Session 9: Scraping Interactive Web Pages

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2025-07-17

Introduction

This Course

Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, we learn how to hunt down wild data. We will:

  • Learn how to find secret APIs
  • Emulate a Browser
  • We focus specifically on step 1 below

Original Image Source: prowebscraper.com

Philipp Pilz via unsplash.com

Request & Collect Raw Data: a closer look

Common Problems

Imagine you wanted to scrape researchgate.net, since it contains self-created profiles of many researchers. However, when you try to get the html content:

library(rvest)
read_html("https://www.researchgate.net/profile/Johannes-Gruber-2")
Error in open.connection(x, "rb"): cannot open the connection

If you don’t know what an HTTP error means, you can go to https://http.cat and have the status explained in a fun way. Below I use a little convenience function:

error_cat <- function(error) {
  link <- paste0("https://http.cat/images/", error, ".jpg")
  knitr::include_graphics(link)
}
error_cat(403)

So what’s going on?

  • If something like this happens, the server essentially did not fullfill our request
  • This is because the website seems to have some special requirements for serving the (correct) content. These could be:
    • specific user agents
    • other specific headers
    • login through browser cookies
  • To find out how the browser manages to get the correct response, we can use the Network tab in the inspection tool

Strategy 1: Emulate what the Browser is Doing

Open the Inspect Window Again:

Strategy 1: Emulate what the Browser is Doing

But this time, we focus on the Network tab:

Here we get an overview of all the network activity of the browser and the individual requests for data that are performed. Clear the network log first and reload the page to see what is going on. Finding the right call is not always easy, but in most cases, we want:

  • a call with status 200 (OK/successful)
  • a document type
  • something that is at least a few kB in size
  • Initiator is usually “other” (we initiated the call by refreshing)

When you identified the call, you can right click -> copy -> copy as cURL

Refresher: cURL Calls

What is cURL:

  • cURL is a library that can make HTTP requests.
  • it is widely used for API calls from the terminal.
  • it lists the parameters of a call in a pretty readable manner:
    • the unnamed argument in the beginning is the Uniform Resource Locator (URL) the request goes to
    • -H arguments describe the headers, which are arguments sent with the call
    • -d is the data or body of a request, which is used e.g., for uploading things
    • -o/-O can be used to write the response to a file (otherwise the response is returned to the screen)
    • --compressed means to ask for a compressed response which is unpacked locally (saves bandwith)
curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H '[Redacted]' \
  -H 'sec-ch-ua: "Chromium";v="115", "Not/A)Brand";v="99"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Linux"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed

httr2::curl_translate()

  • We have seen httr2::curl_translate() in action yesterday (and the alternative https://curlconverter.com/r-httr2/)
  • It can also convert more complicated API calls that make look R no diffrent from a regular browser
  • (Remember: you need to escape all " in the call, press ctrl + F to open the Find & Replace tool and put " in the find \" in the replace field and go through all matches except the first and last):
library(httr2)
httr2::curl_translate(
"curl 'https://www.researchgate.net/profile/Johannes-Gruber-2' \
  -H 'authority: www.researchgate.net' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' \
  -H 'accept-language: en-GB,en;q=0.9' \
  -H 'cache-control: max-age=0' \
  -H 'cookie: [Redacted]' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  -H 'sec-fetch-dest: document' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-user: ?1' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  --compressed"
)
request("https://www.researchgate.net/profile/Johannes-Gruber-2") |>
  req_cookies_set(
    `[Redacted]` = "",
  ) |>
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()

‘Emulating’ the Browser Request

resp <- request("https://www.researchgate.net/profile/Johannes-Gruber-2") |>
  req_headers(
    authority = "www.researchgate.net",
    accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    `accept-language` = "en-GB,en;q=0.9",
    `cache-control` = "max-age=0",
    cookie = "[Redacted]",
    `sec-ch-ua` = "\"Chromium\";v=115\", \"Not/A)Brand\";v=\"99",
    `sec-ch-ua-mobile` = "?0",
    `sec-ch-ua-platform` = "\"Linux\"",
    `sec-fetch-dest` = "document",
    `sec-fetch-mode` = "navigate",
    `sec-fetch-site` = "cross-site",
    `sec-fetch-user` = "?1",
    `upgrade-insecure-requests` = "1",
    `user-agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()
resp
<httr2_response>
GET https://www.researchgate.net/profile/Johannes-Gruber-2
Status: 200 OK
Content-Type: text/html
Body: In memory (322666 bytes)

Mission outcome: success

  • the result can be converted into an rvest object and parsed
resp |> 
  resp_body_html() |> 
  html_elements("[data-testid=\"publicProfileStatsCitations\"]") |> 
  html_text()
[1] "174"

Example: anyflip

Goal

  • get information stored as “flipping book”
  • don’t take individual screenshots/pages, but get whole book

Scout

  • hovering over the page, we find this element

    <div class="side-image"   style="position:absolute;z-index:0;width:100%;height:100%;top:0;lef  t:0;">
      <img id="c7ARbTjMdA" alt="c7ARbTjMdA"   src="../files/large/ae105d2ec767250c5e9a4e2b0a48efaf.webp?174011361  2" style="pointer-events: none; width: 100%; height: 100%;   position: absolute; top: 0px; left: 0px; transform-origin: 0% 0%   0px; transform: scale(1);" width="100%" height="100%">
    </div>
  • looks easy enough! .side-image img should give us the link!

  • hold down ctrl and click on the link in the inspect

Scout

Can we get the actual file?

dir.create("data/anyflip", showWarnings = FALSE)
out_file <- paste0("data/anyflip/1_ae105d2ec767250c5e9a4e2b0a48efaf.webp")
curl::curl_download(
  url = "https://online.anyflip.com/ogboc/rozd/files/large/ae105d2ec767250c5e9a4e2b0a48efaf.webp?1740113612", 
  destfile = out_file
)
file.size(out_file)
[1] 166322

We can!

get images

dl_urls <- paste0("https://online.anyflip.com/ogboc/rozd/files/large/", file_names)
out_files <- paste0("data/anyflip/", sprintf("%03d", seq_along(file_names)), "_", file_names)
curl::multi_download(
  dl_urls,
  out_files
)
# A tibble: 136 × 10
   success status_code resumefrom url   destfile error type  modified           
   <lgl>         <dbl>      <dbl> <chr> <chr>    <chr> <chr> <dttm>             
 1 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 2 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 3 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 4 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 5 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 6 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 7 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 8 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
 9 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
10 TRUE            200          0 http… /home/j… <NA>  imag… 2025-02-21 05:53:27
# ℹ 126 more rows
# ℹ 2 more variables: time <dbl>, headers <list>

Finally: get a single PDF

# convert each image to a pdf
for (page in out_files) {
  magick::image_read(page) |>
    magick::image_write(path = paste0(page, ".pdf"), format = "pdf")
}

# combine into one pdf
title <- read_html("https://anyflip.com/ogboc/rozd/") |> 
  html_elements("title") |> 
  html_text2() |> 
  janitor::make_clean_names()

pdftools::pdf_combine(
  input = paste0(out_files, ".pdf"), 
  output = paste0("data/anyflip/", title, ".pdf")
)
[1] "/home/johannes/Dropbox/Teaching/ess-web-scraping-data-management/09_Interactive_Web_Pages/data/anyflip/rayngan_sthankarn_thang_sangkhm_ssw_11_pi_2567_tpso_11_flip_pdf_any_flip.pdf"

Mission outcome: success!

To summarise the steps:

  1. poking around the website with the inspect tool
  2. found an image link that we could download
  3. the link was not actually in the HTML source, but must come from somewhere
  4. found the request that loads the ‘config’ JSON, containing all the image names in order
  5. guessed the format of the image links from the one link we got + the image names
  6. Downloaded the files -> success!

Extra: getting the text data?

  • since this is a pdf that was made from combining images, we have no actual text data
  • we can use Optical Character Recognition (and early version of visual ML models)
pdftools::pdf_ocr_text(
  "anyflip/002_ca317b05c3eeca7309063ce966b4b933.webp.pdf", 
  language = "tha"
)
Converting page 1 to 002_ca317b05c3eeca7309063ce966b4b933.webp_1.png... done!
[1] "ค ํ า น ํ า\n\nส ํ า น ั ก ง า น ป ล ั ด ก ร ะ ท ร ว ง ก า ร พ ั ฒ น า ส ั ง ค ม แล ะ ค ว า ม ม ั น ค ง ขอ ง ม น ุ ษ ย ์ ได ้ ม อ บ ห ม า ย ให ้\nส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ ส น ั บ ส น ุ น ว ิ ชา ก า ร 1-11 (ส ส ว . 1-11) ด ํ า เน ิ น ก า ร โค ร ง ก า ร จ ั ด ท ํ า ร า ย ง า น ส ถา น ก า ร ณ์ ท า ง\nส ั ง ค ม ใน ร ะ ด ั บ ก ล ุ ่ ม จ ั ง ห ว ั ด เพ ื ่ อ ร ว บ ร ว ม ว ิ เค ร า ะ ห ์ ข้ อ ม ู ล ส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม ท ี ่ เก ิ ด ขึ ้ น ใน เข ต ร ั บ ผิ ด ชอบ ขอ ง\nส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ ส น ั บ ส น ุ น ว ิ ชา ก า ร 11 ใน ก า ร ค า ด ก า ร ณ์ แน ว โน ้ ม ส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม แล ะ ผล ก ร ะ ท บ\nพ ร ้ อ ม ท ั ้ ง เส น อ แน ะ แน ว ท า ง ใน ก า ร แก ้ ไข ป ั ญ ห า ท า ง ส ั ง ค ม ใน พ ื ้ น ท ี ่ ก ล ุ ่ ม จ ั ง ห ว ั ด ใน เข ต พ ื ้ น ท ี ่ ร ั บ ผิ ด ชอบ ขอ ง\nส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ ส น ั บ ส น ุ น ว ิ ชา ก า ร 11\n\nส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ ส น ั บ ส น ุ น ว ิ ชา ก า ร 11 (ส ส ว .11) ได ้ ด ํ า เน ิ น ก า ร จ ั ด ท ํ า ร า ย ง า น\nส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม ร ะ ด ั บ ก ล ุ ่ ม จ ั ง ห ว ั ด ภา ค ไต ้ ต อ น ล ่ า ง ใน พ ื ้ น ท ี ่ เข ต ร ั บ ผิ ด ชอบ ขอ ง ส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ\nส น ั บ ส น ุ น ว ิ ชา ก า ร 11 ได ้ แก ่ จ ั ง ห ว ั ด ต ร ั ง น ร า ธิ ว า ส ป ั ต ต า น ี พ ั ท ล ุ ง ย ะ ล า ส ง ข ล า แล ะ ส ต ู ล ซึ ่ ง ป ร ะ ก อ บ ด ้ ว ย\n6 ส ่ ว น ได ้ แก ่ ส ่ ว น ท ี ่ 1 บ ท น ํ า ส ่ ว น ท ี ่ 2 ข้ อ ม ู ล พ ื ้ น ฐา น ใน พ ื ้ น ท ี ่ ก ล ุ ่ ม จ ั ง ห ว ั ด ส ่ ว น ท ี ่ 3 ส ถา น ก า ร ณ์ ก ล ุ ่ ม เป ้ า ห ม า ย\nท า ง ส ั ง ค ม ร ะ ด ั บ ก ล ุ ่ ม จ ั ง ห ว ั ด ส ่ ว น ท ี ่ 4 ส ถา น ก า ร ณ์ เช ิ ง ป ร ะ เด ็ น ท า ง ส ั ง ค ม แล ะ ส ถา น ก า ร ณ์ เร ่ ง ด ่ ว น (| ๐ 1 |55 น ๑ 5)\nใน ร ะ ด ั บ ก ล ุ ่ ม จ ั ง ห ว ั ด ส ่ ว น ท ี ่ 5 ก า ร ว ิ เค ร า ะ ห ์ แน ว โน ้ ม ขอ ง ส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม ก ล ุ ่ ม จ ั ง ห ว ั ด ส ่ ว น ท ี ่ 6 บ ท ส ร ุ ป\nแล ะ ข้ อ เส น อ แน ะ\n\nผู ้ จ ั ด ท ํ า ห ว ั ง เป ็ น อ ย ่ า ง ย ิ ง ว ่ า ร า ย ง า น ส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม ก ล ุ ่ ม จ ั ง ห ว ั ด ภา ค ใต ้ ต อ น ล ่ า ง\nป ร ะ จ ํ า ป ี 2567 ฉบับ น ี ้ จ ะ เป ็ น ป ร ะ โย ชน ์ ต ่ อ ห น ่ ว ย ง า น ร ะ ด ั บ ท ้ อ ง ถิ ่ น แล ะ ร ะ ด ั บ จ ั ง ห ว ั ด ส า ม า ร ถ น ํ า ข้ อ ม ู ล ใน\nพ ื ้ น ท ี ่ ไป ใช ้ ใน ก า ร ก ํ า ห น ด น โย บ า ย แผ น ง า น โค ร ง ก า ร ใน ก า ร ค ุ ้ ม ค ร อ ง ป ้ อ ง ก ั น แล ะ แก ้ ไข ป ั ญ ห า ท า ง ส ั ง ค ม ใน\nร ะ ด ั บ พ ื ้ น ท ี ่ แล ะ ห น ่ ว ย ง า น ร ะ ด ั บ ก ร ะ ท ร ว ง ส า ม า ร ถ น ํ า ข้ อ ม ู ล ใน ภา พ ร ว ม ไป ใช ้ ป ร ะ โย ชน ์ ว ิ เค ร า ะ ห ์ ส ถา น ก า ร ณ์\nป ั ญ ห า ท า ง ส ั ง ค ม ท ี ่ ส ํ า ค ั ญ แ ล ะ ก ํ า ห น ด น โย บ า ย แผ น ง า น ใน ก า ร ป ้ อ ง ก ั น แล ะ แก ้ ไข ป ั ญ ห า ส ั ง ค ม ภา พ ร ว ม ต ่ อ ไป\n\nส ํ า น ั ก ง า น ส ่ ง เส ร ิ ม แล ะ ส น ั บ ส น ุ น ว ิ ชา ก า ร 11\nก ร ก ฎา ค ม 2567\nร า ย ง า น ส ถา น ก า ร ณ์ ท า ง ส ั ง ค ม ก ล ุ ่ ม จ ั ง ห ว ั ด ภา ค ไต ้ ต อ น ล ่ า ง ป ร ะ จ ํ า ป ี 2567 | 1\n"

Example: ICA (International Communication Association) 2023 Conference

Goal

  • Let’s say we want to build a database of conference attendance
  • So for each conference website we want to get:
    • Speakers
    • (Co-)authors
    • Paper/talk titles
    • Panel (to see who was in the same ones)

Trying to scrape the programme

  • The page looks straightforward enough!
  • There is a “Conference Schedule” with links to the individual panels
  • The table has a pretty nice class by which we can select it: class="agenda-content"
html <- read_html("https://www.icahdq.org/mpage/ICA23-Program")
Error in open.connection(x, "rb"): cannot open the connection

Let’s Check our Network Tab

  • I noticed a request that takes quite long and retrieves a relatively large object (500kB)
  • Clicking on it opens another window showing the response
  • Wait, is this a json with the entire conference schedule?

Translating the cURL call

curl_translate("curl 'https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/?event_id=JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D' \
  -H 'Accept: application/json, text/plain, */*' \
  -H 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Pragma: no-cache' \
  -H 'Referer: https://whova.com/embedded/event/JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4%3D/' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36' \
  -H 'sec-ch-ua: \"Chromium\";v=\"115\", \"Not/A)Brand\";v=\"99\"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: \"Linux\"' \
  --compressed")
request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/") |>
  req_url_query(
    event_id = "JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4=",
  ) |>
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform()

Requesting the json (?)

ica_data <- request("https://whova.com/xems/apis/event_webpage/agenda/public/get_agendas/") |>
  req_url_query(
    event_id = "JcQAdK91J0qLUtNxOYUVWFMTUuQgIg3Xj6VIeeyXVR4=",
  ) |>
  req_headers(
    Accept = "application/json, text/plain, */*",
    `Accept-Language` = "en-GB,en-US;q=0.9,en;q=0.8",
    `Cache-Control` = "no-cache",
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
  ) |>
  req_perform() |> 
  resp_body_json()
object.size(ica_data) |> 
  format("MB")
[1] "6.7 Mb"

It worked!

Wrangling with Json

  • This json file or the R object it produces is quite intimidating.
  • To get to a certain panel on the fourth day, for example, we have to enter this insane path:
ica_data[["data"]][["agenda"]][[4]][["time_ranges"]][[3]][[2]][[65]][[1]][["sessions"]][[1]] |> 
  lobstr::tree(max_length = 30)
<list>
├─id: 3113186
├─name: "Race, Ethnicity, and Religion: M..."
├─start_time: "09:00"
├─end_time: "10:15"
├─calendar_stime: "2023-05-28 09:00:00"
├─calendar_etime: "2023-05-28 10:15:00"
├─place: "M - Chestnut East"
├─desc: "<br /><br /><b>Papers: </b><br /..."
├─docs: <list>
├─session_order: 791
├─session_feedback_enable: TRUE
├─speaker: <list>
├─expand: "yes"
├─speaker_label: "Session chair"
├─type: 1
├─sponsors: <list>
├─programs: <list>
├─tracks: <list>
│ ├─<list>
│ │ ├─name: "In Person"
│ │ ├─id: 539417
│ │ └─color: "#5C6BC0"
│ └─<list>
│   ├─name: "Political Communication"
│   ├─id: 540044
│   └─color: "#a15284"
└─tags: <list>
  • Essentially, someone pressed a relational database into a list format and we now have to scramble to cope with this monstrosity

Parsing the Json

I could not come up with a better method so far. The only way to extract the data is with a nested for loop going through all days and all entries in the object and looking for elements called “sessions”.

library(tidyverse, warn.conflicts = FALSE)
sessions <- list()

for (day in 1:5) {
  
  times <- ica_data[["data"]][["agenda"]][[day]][["time_ranges"]]
  
  for (l_one in seq_along(pluck(times))) {
    for (l_two in seq_along(pluck(times, l_one))) {
      for (l_three in seq_along(pluck(times, l_one, l_two))) {
        for (l_four in seq_along(pluck(times, l_one, l_two, l_three))) {
          
          session <- pluck(times, l_one, l_two, l_three, l_four, "sessions", 1)
          id <- pluck(session, "id")
          if (!is.null(id)) {
            id <- as.character(id)
            sessions[[id]] <- session
          }
          
        }
      }
    }
  }
}

Parsing the Json data

ica_data_df <- tibble(
  panel_id = map_int(sessions, "id"),
  panel_name = map_chr(sessions, "name"),
  time = map_chr(sessions, "calendar_stime"),
  desc = map_chr(sessions, function(s) pluck(s, "desc", .default = NA))
)
ica_data_df
# A tibble: 881 × 4
   panel_id panel_name                                               time  desc 
      <int> <chr>                                                    <chr> <chr>
 1  3113155 PRECONFERENCE: Games and the (Playful) Future of Commun… 2023… "Rec…
 2  3113156 PRECONFERENCE: Generation Z and Global Communication     2023… "Gen…
 3  3113166 PRECONFERENCE: Nothing About Us, Without Us: Authentic … 2023… "Thi…
 4  3113172 PRECONFERENCE: Reimagining the Field of Media, War and … 2023… "As …
 5  3113175 PRECONFERENCE: The Legacies of Elihu Katz                2023… "Eli…
 6  3112705 Human-Machine Preconference Breakout (room 2)            2023…  <NA>
 7  3113080 New Avoidance Preconference Breakout (room 2)            2023…  <NA>
 8  3113150 PRECONFERENCE: 12th Annual Doctoral Consortium of the C… 2023… "The…
 9  3113154 PRECONFERENCE: Ethics of Critically Interrogating and R… 2023… "The…
10  3113158 PRECONFERENCE: Human-Machine Communication: Authenticit… 2023… "The…
# ℹ 871 more rows

Extracting paper title and authors

Finally we want to parse the HTML in the description column.

ica_data_df$desc[100]
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      3113023 
"<br /><br /><b>Participants: </b><br /><b><i>(Chairs) </i></b>Wayne Xu, U of Massachusetts Amherst<br /><br /><b>Papers: </b><br />Disentangling the Longitudinal Relationship Between Social Media Use, Political Expression and Political Participation: What Do We Really Know?<br /><i>Jörg Matthes, U of Vienna</i><br /><i>Andreas Nanz, U of Vienna</i><br /><i>Marlis Stubenvoll, U of Vienna</i><br /><i>Ruta Kaskeleviciute, U of Vienna</i><br /><br />Political Discussions on Russian YouTube: How Did They Change Since the Start of the War in Ukraine?<br /><i>Ekaterina Romanova, U of Florida</i><br /><br />Perceptions of and Reactions to Different Types of Incivility in Public Online Discussions: Results of an Online Experiment<br /><i>Marike Bormann, Unviersity of Düsseldorf</i><br /><i>Dominique Heinbach, Heinrich-Heine-U</i><br /><i>Jan Kluck, U of Duisburg-Essen</i><br /><i>Marc Ziegele, Heinrich Heine U</i><br /><br />When Trust in AI Mediates: AI News Use, Public Discussion, and Civic Participation<br /><i>Seungahn Nah, U of Florida</i><br /><i>Chun Shao, Arizona State U</i><br /><i>Ekaterina Romanova, U of Florida</i><br /><i>Gwiwon Nam, U of Florida</i><br /><i>Fanjue Liu, U of Florida</i> <a href='https://ica2023.cadmore.media/object/451094' style='text-decoration: none; background-color: #789F90; color: #FFFFFF; padding: 5px 10px; border: 1px solid #789F90; border-radius: 15px;'>Open Session</a><br /><br />" 

We can inspect HTML content by writing it to a temporary file and opening it in the browser. Below is a function that does this automatically for you:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(ica_data_df$desc[100])

Extracting paper title and authors using a function

I wrote another function for this. You can check some of the panels using the browser: check_in_browser(ica_data_df$desc[100]).

pull_papers <- function(desc) {
  # we extract the html code starting with the papers line
  papers <- str_extract(desc, "<b>Papers: </b>.+$") |> 
    str_remove("<b>Papers: </b><br />") |> 
    # we split the html by double line breaks, since it is not properly formatted as paragraphs
    strsplit("<br /><br />", fixed = TRUE) |> 
    pluck(1)
  
  
  # if there is no html code left, just return NAs
  if (all(is.na(papers))) {
    return(list(list(paper_title = NA, authors = NA)))
  } else {
    # otherwise we loop through each paper
    map(papers, function(t) {
      html <- read_html(t)
      
      # first line is the title
      title <- html |> 
        html_text2() |> 
        str_extract("^.+\n")
      
      # at least authors are formatted italice
      authors <- html_elements(html, "i") |> 
        html_text2()
      
      list(paper_title = title, authors = authors)
    })
  }
}

Now we have all the information we wanted:

ica_data_df_tidy <- ica_data_df |> 
  slice(-613) |> 
  mutate(papers = map(desc, pull_papers)) |> 
  unnest(papers) |> 
  unnest_wider(papers) |> 
  unnest(authors) |> 
  select(-desc) |> 
  filter(!is.na(authors))
ica_data_df_tidy
# A tibble: 8,169 × 5
   panel_id panel_name                            time       paper_title authors
      <int> <chr>                                 <chr>      <chr>       <chr>  
 1  3113249 The Powers of Platforms               2023-05-2… "Serve the… Changw…
 2  3113249 The Powers of Platforms               2023-05-2… "Serve the… Ziyi W…
 3  3113249 The Powers of Platforms               2023-05-2… "Serve the… Joel G…
 4  3113249 The Powers of Platforms               2023-05-2… "Empowered… Andrea…
 5  3113249 The Powers of Platforms               2023-05-2… "Empowered… Jacob …
 6  3113249 The Powers of Platforms               2023-05-2… "The Rise … Guy Ho…
 7  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Lucia …
 8  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Mathia…
 9  3113249 The Powers of Platforms               2023-05-2… "Google Ne… Amalia…
10  3112411 Affiliate Journals Top Papers Session 2023-05-2… "One Year … Eloria…
# ℹ 8,159 more rows
ica_data_df_tidy |> 
  filter(!duplicated(paper_title))
# A tibble: 3,277 × 5
   panel_id panel_name                                 time  paper_title authors
      <int> <chr>                                      <chr> <chr>       <chr>  
 1  3113249 The Powers of Platforms                    2023… "Serve the… Changw…
 2  3113249 The Powers of Platforms                    2023… "Empowered… Andrea…
 3  3113249 The Powers of Platforms                    2023… "The Rise … Guy Ho…
 4  3113249 The Powers of Platforms                    2023… "Google Ne… Lucia …
 5  3112411 Affiliate Journals Top Papers Session      2023… "One Year … Eloria…
 6  3112411 Affiliate Journals Top Papers Session      2023… "Digital A… Michel…
 7  3112411 Affiliate Journals Top Papers Session      2023… "Knowledge… Xiao Z…
 8  3112411 Affiliate Journals Top Papers Session      2023… "Stop Stud… Benjam…
 9  3112488 Communication in Interorganizational Coll… 2023… "Towards a… Erich …
10  3112488 Communication in Interorganizational Coll… 2023… "Nonprofit… Sophia…
# ℹ 3,267 more rows

Exercises 1

First, review the material and make sure you have a broad understanding how to:

  • look at the requests the browser makes
  • understand how you can copy a curl call
  • practice how you can translate it into R code
  • why we go this route and do not simply use read_html
  1. Open the ICA site in your browser and inspect the network traffic. Can you identify the call to the programme json?
  2. Copy the curl code to R and translate it to get the same

Example: X-Twitter

Goal

  1. Tweets from a Twitter profile
  2. Get the text, likes, shares and comments

Can we use rvest?

xhtml <- read_html("https://x.com/elonmusk")
Error in open.connection(x, "rb"): cannot open the connection

At least one of these elements should be here!

Probing the hidden/internal API

Translating a request

curl_translate("curl 'https://api.x.com/graphql/Le1DChzkS7ioJH_yEPMi3w/UserTweets?variables=%7B%22userId%22%3A%2244196397%22%2C%22count%22%3A20%2C%22includePromotedContent%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%7D^&features=%7B%22rweb_video_screen_enabled%22%3Afalse%2C%22payments_enabled%22%3Afalse%2C%22profile_label_improvements_pcf_label_in_post_enabled%22%3Atrue%2C%22rweb_tipjar_consumption_enabled%22%3Atrue%2C%22verified_phone_label_enabled%22%3Afalse%2C%22creator_subscriptions_tweet_preview_api_enabled%22%3Atrue%2C%22responsive_web_graphql_timeline_navigation_enabled%22%3Atrue%2C%22responsive_web_graphql_skip_user_profile_image_extensions_enabled%22%3Afalse%2C%22premium_content_api_read_enabled%22%3Afalse%2C%22communities_web_enable_tweet_community_results_fetch%22%3Atrue%2C%22c9s_tweet_anatomy_moderator_badge_enabled%22%3Atrue%2C%22responsive_web_grok_analyze_button_fetch_trends_enabled%22%3Afalse%2C%22responsive_web_grok_analyze_post_followups_enabled%22%3Afalse%2C%22responsive_web_jetfuel_frame%22%3Atrue%2C%22responsive_web_grok_share_attachment_enabled%22%3Atrue%2C%22articles_preview_enabled%22%3Atrue%2C%22responsive_web_edit_tweet_api_enabled%22%3Atrue%2C%22graphql_is_translatable_rweb_tweet_is_translatable_enabled%22%3Atrue%2C%22view_counts_everywhere_api_enabled%22%3Atrue%2C%22longform_notetweets_consumption_enabled%22%3Atrue%2C%22responsive_web_twitter_article_tweet_consumption_enabled%22%3Atrue%2C%22tweet_awards_web_tipping_enabled%22%3Afalse%2C%22responsive_web_grok_show_grok_translated_post%22%3Afalse%2C%22responsive_web_grok_analysis_button_from_backend%22%3Atrue%2C%22creator_subscriptions_quote_tweet_preview_enabled%22%3Afalse%2C%22freedom_of_speech_not_reach_fetch_enabled%22%3Atrue%2C%22standardized_nudges_misinfo%22%3Atrue%2C%22tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled%22%3Atrue%2C%22longform_notetweets_rich_text_read_enabled%22%3Atrue%2C%22longform_notetweets_inline_media_enabled%22%3Atrue%2C%22responsive_web_grok_image_annotation_enabled%22%3Atrue%2C%22responsive_web_grok_community_note_auto_translation_is_enabled%22%3Afalse%2C%22responsive_web_enhance_cards_enabled%22%3Afalse%7D^&fieldToggles=%7B%22withArticlePlainText%22%3Afalse%7D' \
  --compressed \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0' \
  -H 'Accept: */*' \
  -H 'Accept-Language: en-US,en;q=0.5' \
  -H 'Accept-Encoding: gzip, deflate, br, zstd' \
  -H 'content-type: application/json' \
  -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA' \
  -H 'x-guest-token: 1945748957512028451' \
  -H 'x-twitter-client-language: en' \
  -H 'x-twitter-active-user: yes' \
  -H 'x-client-transaction-id: /+X/QU0nFKRUwDDhoNpOxN6hcq7l28ulNkNdcfdFEE3U31BgOAEaTqD7IWkFeJUVWBa71vsj/XOKKVUiqywC4VnCqcTD/A' \
  -H 'x-xp-forwarded-for: 07099bf3db1505b55e501219de247298d63e25c7a99657e84445e2b0b408ca3484c1f8b62fe6e09da632121120e31ebee903a69b84d0b97628cb3249588cf1e752bbe7137fa549048abb4e8a2e6be33b425ca540a3563942559e9547c77a87f2d12c7fdf5f2d55f42a342ce6deb594ccf69563ebf09c2f2f874426a71861d43e81a8ca2e17e44ff85031a512c015e925e74e3c82029c9ca206bc1b56c74eca99b481c137a6be132d8847424a594c6598b2860b54ae015ad05050947034615b147623213dc862af21edf26ec464c6bad4a8b6' \
  -H 'Origin: https://x.com' \
  -H 'Sec-GPC: 1' \
  -H 'Connection: keep-alive' \
  -H 'Referer: https://x.com/' \
  -H 'Cookie: guest_id=v1%3A175206553082584585; twtr_pixel_opt_in=Y; __cf_bm=kt7jUopsRK0SahsJuTtuGxY.BWNhKqnaxlCx2CPJF9w-1752737676-1.0.1.1-GUgsMYuLL4uw8QeP5leUchXGdroVhoI8E2HalWnyoDecTcVKHkYfCbzqAUjsodmz48D8flKxs1HsltfUTRYbJPbp9q0spQcCy2.cn56VSXE; gt=1945748957512028451' \
  -H 'Sec-Fetch-Dest: empty' \
  -H 'Sec-Fetch-Mode: cors' \
  -H 'Sec-Fetch-Site: same-site' \
  -H 'TE: trailers'")
request("https://api.x.com/graphql/Le1DChzkS7ioJH_yEPMi3w/UserTweets") |>
  req_url_query(
    variables = '{"userId":"44196397","count":20,"includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true}^',
    features = '{"rweb_video_screen_enabled":false,"payments_enabled":false,"profile_label_improvements_pcf_label_in_post_enabled":true,"rweb_tipjar_consumption_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"premium_content_api_read_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"responsive_web_grok_analyze_button_fetch_trends_enabled":false,"responsive_web_grok_analyze_post_followups_enabled":false,"responsive_web_jetfuel_frame":true,"responsive_web_grok_share_attachment_enabled":true,"articles_preview_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"responsive_web_grok_show_grok_translated_post":false,"responsive_web_grok_analysis_button_from_backend":true,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_grok_image_annotation_enabled":true,"responsive_web_grok_community_note_auto_translation_is_enabled":false,"responsive_web_enhance_cards_enabled":false}^',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |>
  req_cookies_set(
    guest_id = "v1:175206553082584585",
    twtr_pixel_opt_in = "Y",
    `__cf_bm` = "kt7jUopsRK0SahsJuTtuGxY.BWNhKqnaxlCx2CPJF9w-1752737676-1.0.1.1-GUgsMYuLL4uw8QeP5leUchXGdroVhoI8E2HalWnyoDecTcVKHkYfCbzqAUjsodmz48D8flKxs1HsltfUTRYbJPbp9q0spQcCy2.cn56VSXE",
    gt = "1945748957512028451",
  ) |>
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    `x-guest-token` = "1945748957512028451",
    `x-twitter-client-language` = "en",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "/+X/QU0nFKRUwDDhoNpOxN6hcq7l28ulNkNdcfdFEE3U31BgOAEaTqD7IWkFeJUVWBa71vsj/XOKKVUiqywC4VnCqcTD/A",
    `x-xp-forwarded-for` = "07099bf3db1505b55e501219de247298d63e25c7a99657e84445e2b0b408ca3484c1f8b62fe6e09da632121120e31ebee903a69b84d0b97628cb3249588cf1e752bbe7137fa549048abb4e8a2e6be33b425ca540a3563942559e9547c77a87f2d12c7fdf5f2d55f42a342ce6deb594ccf69563ebf09c2f2f874426a71861d43e81a8ca2e17e44ff85031a512c015e925e74e3c82029c9ca206bc1b56c74eca99b481c137a6be132d8847424a594c6598b2860b54ae015ad05050947034615b147623213dc862af21edf26ec464c6bad4a8b6",
    Origin = "https://x.com",
    `Sec-GPC` = "1",
    TE = "trailers",
  ) |>
  req_perform()
twitter_resp <- request("https://api.x.com/graphql/Le1DChzkS7ioJH_yEPMi3w/UserTweets") |>
  req_url_query(
    variables = '{"userId":"44196397","count":20,"includePromotedContent":true,"withQuickPromoteEligibilityTweetFields":true,"withVoice":true}^',
    features = '{"rweb_video_screen_enabled":false,"payments_enabled":false,"profile_label_improvements_pcf_label_in_post_enabled":true,"rweb_tipjar_consumption_enabled":true,"verified_phone_label_enabled":false,"creator_subscriptions_tweet_preview_api_enabled":true,"responsive_web_graphql_timeline_navigation_enabled":true,"responsive_web_graphql_skip_user_profile_image_extensions_enabled":false,"premium_content_api_read_enabled":false,"communities_web_enable_tweet_community_results_fetch":true,"c9s_tweet_anatomy_moderator_badge_enabled":true,"responsive_web_grok_analyze_button_fetch_trends_enabled":false,"responsive_web_grok_analyze_post_followups_enabled":false,"responsive_web_jetfuel_frame":true,"responsive_web_grok_share_attachment_enabled":true,"articles_preview_enabled":true,"responsive_web_edit_tweet_api_enabled":true,"graphql_is_translatable_rweb_tweet_is_translatable_enabled":true,"view_counts_everywhere_api_enabled":true,"longform_notetweets_consumption_enabled":true,"responsive_web_twitter_article_tweet_consumption_enabled":true,"tweet_awards_web_tipping_enabled":false,"responsive_web_grok_show_grok_translated_post":false,"responsive_web_grok_analysis_button_from_backend":true,"creator_subscriptions_quote_tweet_preview_enabled":false,"freedom_of_speech_not_reach_fetch_enabled":true,"standardized_nudges_misinfo":true,"tweet_with_visibility_results_prefer_gql_limited_actions_policy_enabled":true,"longform_notetweets_rich_text_read_enabled":true,"longform_notetweets_inline_media_enabled":true,"responsive_web_grok_image_annotation_enabled":true,"responsive_web_grok_community_note_auto_translation_is_enabled":false,"responsive_web_enhance_cards_enabled":false}^',
    fieldToggles = '{"withArticlePlainText":false}',
  ) |>
  req_cookies_set(
    guest_id = "v1:175206553082584585",
    twtr_pixel_opt_in = "Y",
    `__cf_bm` = "kt7jUopsRK0SahsJuTtuGxY.BWNhKqnaxlCx2CPJF9w-1752737676-1.0.1.1-GUgsMYuLL4uw8QeP5leUchXGdroVhoI8E2HalWnyoDecTcVKHkYfCbzqAUjsodmz48D8flKxs1HsltfUTRYbJPbp9q0spQcCy2.cn56VSXE",
    gt = "1945748957512028451",
  ) |>
  req_headers(
    `User-Agent` = "Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0",
    Accept = "*/*",
    `Accept-Language` = "en-US,en;q=0.5",
    `Accept-Encoding` = "gzip, deflate, br, zstd",
    `content-type` = "application/json",
    authorization = "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA",
    `x-guest-token` = "1945748957512028451",
    `x-twitter-client-language` = "en",
    `x-twitter-active-user` = "yes",
    `x-client-transaction-id` = "/+X/QU0nFKRUwDDhoNpOxN6hcq7l28ulNkNdcfdFEE3U31BgOAEaTqD7IWkFeJUVWBa71vsj/XOKKVUiqywC4VnCqcTD/A",
    `x-xp-forwarded-for` = "07099bf3db1505b55e501219de247298d63e25c7a99657e84445e2b0b408ca3484c1f8b62fe6e09da632121120e31ebee903a69b84d0b97628cb3249588cf1e752bbe7137fa549048abb4e8a2e6be33b425ca540a3563942559e9547c77a87f2d12c7fdf5f2d55f42a342ce6deb594ccf69563ebf09c2f2f874426a71861d43e81a8ca2e17e44ff85031a512c015e925e74e3c82029c9ca206bc1b56c74eca99b481c137a6be132d8847424a594c6598b2860b54ae015ad05050947034615b147623213dc862af21edf26ec464c6bad4a8b6",
    Origin = "https://x.com",
    `Sec-GPC` = "1",
    TE = "trailers",
  ) |>
  req_perform()

Parsing the Twitter data

This is the code we developed in session 2. We can use it again to get a clean table with some interesting information

ess_tweets <- twitter_resp |> 
  resp_body_json()

entries <- pluck(ess_tweets, "data", "user", "result", "timeline", "timeline", "instructions", 3L, "entries")

tweets <- map(entries, function(x) pluck(x, "content", "itemContent", "tweet_results", "result", "legacy"))

tweets_df <- map(tweets, function(t) {
  tibble(
    id = t$id_str,
    user_id = t$user_id_str,
    created_at = lubridate::parse_date_time(t$created_at, "a b d H M S z Y"),
    full_text = t$full_text,
    favorite_count = t$favorite_count,
    retweet_count = t$retweet_count,
    bookmark_count = t$bookmark_count
  )
}) |> 
  bind_rows()
tweets_df

Mission failure

I stopped at this point since there are three issue that are unclear to resolve:

  1. How do we get more than 98 tweets (i.e., scroll the “cursor”)?
  2. How do we find the user ID?
  3. We have to send several identifiers: It is not clear how stable x-csrf-token, authorization, and the cookies are

Summary: hidden APIs

What are they

  • used by services of a company to communicate with each other
  • code on a website often uses one to download additional conent
  • the browser logs them and provides them to us as cURL calls

What are they good for?

  • We can often use them to get content that is otherwise unavailable
  • We can study them to find out what requests the website server accepts
  • Some websites allow access just using a special header or cookies
  • If they are somewhat flexible we can wrap them in a function or package
  • This can allow us to gather data on scale

Issues

  • Companies have mechanisms to counter scraping:
    • signing specific requests (TikTok)
    • obscuring pagination (Twitter)
    • rate limiting requests per second/minute/day and user/IP(Twitter)
    • expiring session tokens (telegraaf.nl)

Wrap Up

Save some information about the session for reproducibility.

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rvest_1.0.4     httr2_1.2.0     lubridate_1.9.4 forcats_1.0.0  
 [5] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4     readr_2.1.5    
 [9] tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.1   tidyverse_2.0.0
[13] tinytable_0.8.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6      xfun_0.52         websocket_1.4.4   processx_3.8.6   
 [5] tzdb_0.5.0        vctrs_0.6.5       tools_4.5.1       ps_1.9.0         
 [9] generics_0.1.3    curl_6.4.0        fansi_1.0.6       pkgconfig_2.0.3  
[13] pdftools_3.5.0    tesseract_5.2.3   lifecycle_1.0.4   compiler_4.5.1   
[17] munsell_0.5.1     chromote_0.5.1    janitor_2.2.1     snakecase_0.11.1 
[21] litedown_0.7      htmltools_0.5.8.1 yaml_2.3.10       later_1.4.1      
[25] pillar_1.10.2     crayon_1.5.3      magick_2.8.6      tidyselect_1.2.1 
[29] digest_0.6.37     stringi_1.8.7     fastmap_1.2.0     grid_4.5.1       
[33] colorspace_2.1-1  cli_3.6.5         magrittr_2.0.3    utf8_1.2.6       
[37] withr_3.0.2       scales_1.3.0      promises_1.3.2    rappdirs_0.3.3   
[41] timechange_0.3.0  rmarkdown_2.29    httr_1.4.7        lobstr_1.1.2     
[45] qpdf_1.3.5        askpass_1.2.1     hms_1.1.3         evaluate_1.0.3   
[49] knitr_1.50        rlang_1.1.6       Rcpp_1.0.14       docopt_0.7.2     
[53] glue_1.8.0        selectr_0.4-2     xml2_1.3.8        rstudioapi_0.17.1
[57] jsonlite_2.0.0    R6_2.6.1